Aim

The aim of this package is to simplify RNAseq analysis by integrating some pre-existing widely used packages (DESeq21, topGO2, WGCNA3 and rrvgo5, in a tidyverse6 environment) to obtain results and visualization after performing differential expression analysis, PCA, Gene Ontology enrichment and gene co-expression clustering.

RNAseqEasy workflow

The data used as example in this vignette was generated in the work of Liang et al.7 in the liverwort Marchantia polymorpha.

Standard workflow

Quick start

The main function in RNAseqEasy package is DESeq2_simple(), which performs principal component analysis, differential expression analysis and Gene Ontology enrichment analysis from quantified samples.

DiffExprAn <- DESeq2_simple(
  output_path = output_Dir_DESeq2, # Path to save results
  sampleDir = sampleDir, # Directory with your Salmon quantified data
  sample_table = sample_table, # Data frame with your sample names and variables
  tx2gene = Marchantia7_tx2gene, # Data frame with your Gene - Transcript relationship
  Variable = c("Genotype", "Treatment"), # Variables from sample_table affecting your analysis
  Design = "Genotype + Treatment + Genotype:Treatment", # How to model samples
  Name = "Tak1_OPDA", # Name to include in result files
  Contrast = Tak1_OPDA - Tak1_Mock, 
  geneID2GO = geneID2GO, # GO functional annotation
  orgdb = "org.At.tair.db" # Reference organism from rrvgo database
  )

DESeq2_simple() function general scheme

Input data

The RNAseqEasy package is thought to be used from Salmon mapped and quantified data. The first required input is the path [1] of the directory where quant.sf samples are saved will be the input. The other input necessary for the analysis is a data frame [2] that includes the name of the samples and the information describing them (i.e. variables or conditions defining the experiment).

Salmon data

Salmon4 is a widely used software for quantifying transcript abundance. A well detailed tutorial can be found in their official webpage. It is a very fast tool for transcript quantification directly from fastq.gzfiles. The only requirement is the creation of an index of the transcriptome of the organism we are working with (do not use the genome!!!), also explained in their tutorial.

As a result, we will obtain a folder [1] containing one folder per analyzed sample with its names. In each folder, the main output fail will be named quant.sf, which contains the name of each transcript and their abundance in Transcripts Per Million (TPM), among other information (length, effective length and number of reads). The path to this output folder will be input for our analysis [1].

# URL from Zenodo
data_url <- "https://zenodo.org/records/15800134/files/GSE275561.zip?download=1"

# Temporal directory to download data
output_dir <- file.path(tempdir(), "GSE275561")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

zip_file <- file.path(tempdir(), "GSE275561.zip")

# Download zip file with Salmon quantified data from Zenodo
download.file(url = data_url, destfile = zip_file, mode = "wb")

# Unzip compressed file
unzip(zip_file, exdir = output_dir)

list.files(output_dir)
## [1] "__MACOSX"              "02_Salmon"             "Sample_Data_Wendy.txt"

So, we set the “02_Salmon” folder as sampleDir, which will direct the analysis to the directory to retrieve quantified data for each sample.

sampleDir <- file.path(output_dir, "02_Salmon")

Sample table

For each experiment analysis, it is required to generate a data frame [2] (we call it sample_table) which correlates the name of each sequenced sample with the variables describing it in the experiment. So, it will include a Sample column including sample names, and an extra column for each variable of factor that were included in the experiment design. In our example, we include 12 different samples (Wendy1 to Wendy12), and there were 3 different variables describing them:

  • Genotype. Samples were divided in two different genotypes: ‘Tak1’ (wild-type organisms) and ‘gh3a’ (CRISPR-Cas9 mutants).
  • Treatment. Two different treatments were applied to samples: ‘Mock’ or ‘OPDA’ (exposure to a high concentration of a plant stress hormone from jasmonic acid family).
  • Replicate. Each category of samples included 3 independent biological replicates.
# Data frame with 'Sample', 'Genotype', 'Treatment and 'Replicate' information

sample_table <- data.frame(
  Sample = paste0("Wendy", seq(1,12)),
  Genotype = rep(c("Tak1", "gh3a"), each = 6),
  Treatment = rep(c("Mock", "OPDA"), each = 3),
  Replicate = seq(1,3)
)

sample_table
##     Sample Genotype Treatment Replicate
## 1   Wendy1     Tak1      Mock         1
## 2   Wendy2     Tak1      Mock         2
## 3   Wendy3     Tak1      Mock         3
## 4   Wendy4     Tak1      OPDA         1
## 5   Wendy5     Tak1      OPDA         2
## 6   Wendy6     Tak1      OPDA         3
## 7   Wendy7     gh3a      Mock         1
## 8   Wendy8     gh3a      Mock         2
## 9   Wendy9     gh3a      Mock         3
## 10 Wendy10     gh3a      OPDA         1
## 11 Wendy11     gh3a      OPDA         2
## 12 Wendy12     gh3a      OPDA         3

In differential expression analysis, comparisons are performed against a reference level. By default, R set levels based on alphabetical order for each variable. We can set the specific comparison we want to perform in each analysis, but it is a good practice to set levels order as we want to before we continue. We transform each variable column into factor and establish the level order. In this example, “TaK-1” is the reference genotype, and “Mock” is the reference treatment.

# Convert 'Genotype' into factor, setting "Tak1" as reference
sample_table$Genotype <- factor(sample_table$Genotype,
                                levels = c("Tak1", "gh3a"))

# Convert 'Treatment' into factor, setting "Mock" as reference
sample_table$Treatment <- factor(sample_table$Treatment, levels = c("Mock", "OPDA"))

str(sample_table)
## 'data.frame':    12 obs. of  4 variables:
##  $ Sample   : chr  "Wendy1" "Wendy2" "Wendy3" "Wendy4" ...
##  $ Genotype : Factor w/ 2 levels "Tak1","gh3a": 1 1 1 1 1 1 2 2 2 2 ...
##  $ Treatment: Factor w/ 2 levels "Mock","OPDA": 1 1 1 2 2 2 1 1 1 2 ...
##  $ Replicate: int  1 2 3 1 2 3 1 2 3 1 ...

Functional annotation

Everything else required for the analysis is not specific of the experiment you are performing, but depends on the organism you are working with. There are two different files that we will need to import for the analysis:

  • A data frame that correlates each transcript name with its corresponding gene name.
  • A file including functional annotation, i.e., correlating each gene/transcript to GO terms.

In this example, we are working with samples extracted from the liverwort Marchantia polymorpha, a model plant widely use to study how ancient or conserved biological processes are in plant evolution. To get the reference transcriptome, we download it from marchantia.info, the Genome Database for Marchantia polymorpha. The get.fasta.name() function from phylotools8 package allows us to obtain the transcript names from the fasta file. In Marchantia, gene names and transcript share the same nomenclature, only adding a dot and a number to the different transcripts of a gene. To correlate them, we create a data frame including a column called Name containing transcript names, and based on that we create a new column called GENEID removing last 2 characters (using str_sub() function from string package).

library(phylotools)
library(tidyverse)
URL <- gzcon(url(paste("https://marchantia.info/download/MpTak_v6.1/", "MpTak_v6.1r1.mrna.fasta.gz", sep="")))
txt <- readLines(URL)
Marchantia7_Transcripts <- get.fasta.name(textConnection(txt), clean_name = FALSE)
head(Marchantia7_Transcripts)
## [1] "Mp1g00070.1" "Mp1g00080.1" "Mp1g00090.1" "Mp1g00090.2" "Mp1g00090.3"
## [6] "Mp1g00090.4"
Marchantia7_tx2gene <- data.frame(Name = Marchantia7_Transcripts) %>%
  tidyr::separate(Name, sep = " ", into = c("TXNAME", "CDS")) %>%
  dplyr::mutate(GENEID = stringr::str_sub(TXNAME, 1, -3)) %>%
  dplyr::select(-CDS)

head(Marchantia7_tx2gene, 10)
##         TXNAME    GENEID
## 1  Mp1g00070.1 Mp1g00070
## 2  Mp1g00080.1 Mp1g00080
## 3  Mp1g00090.1 Mp1g00090
## 4  Mp1g00090.2 Mp1g00090
## 5  Mp1g00090.3 Mp1g00090
## 6  Mp1g00090.4 Mp1g00090
## 7  Mp1g00090.5 Mp1g00090
## 8  Mp1g00090.6 Mp1g00090
## 9  Mp1g00100.1 Mp1g00100
## 10 Mp1g00110.1 Mp1g00110

For this example, functional annotation of Gene Ontologies (GOs) for Marchantia genes can be downloaded from the package in a csv file. Many organisms reference databases include this kind of files. If you cannot find them, there are tools and tutorial to annotate your reference genome.

library(RNAseqEasy)

GO_data <- system.file("extdata",
                      "Mpo6.1_GO_db_GeneGO.csv",
                      package="RNAseqEasy", mustWork=TRUE)
Mpo_GO_GOSLIM <- read.csv(GO_data, sep = "\t")
head(Mpo_GO_GOSLIM)
##   Transcript         GO
## 1  Mp1g00060 GO:0003676
## 2  Mp1g00060 GO:0032774
## 3  Mp1g00060 GO:0034062
## 4  Mp1g00060 GO:0042644
## 5  Mp1g00060 GO:0046914
## 6  Mp1g00070 GO:0000075

This annotation file is a two column data frame correlating transcript names and GO terms (one assotiation per row). For topGO GO enrichement analyses, we will need to convert it into a list containing transcript as keys and all their annotated GO terms as values. The load_topGO_db() function included in this package helps us in this purpose.

geneID2GO <- load_topGO_db(Mpo_GO_GOSLIM, "Transcript", "GO")
head(geneID2GO)
## $Mp1g00060
## [1] "GO:0003676" "GO:0032774" "GO:0034062" "GO:0042644" "GO:0046914"
## 
## $Mp1g00070
##  [1] "GO:0000075" "GO:0000123" "GO:0000725" "GO:0000790" "GO:0000800"
##  [6] "GO:0000976" "GO:0001894" "GO:0003682" "GO:0003700" "GO:0004842"
## [11] "GO:0005704" "GO:0005759" "GO:0005813" "GO:0005829" "GO:0005886"
## [16] "GO:0006302" "GO:0007623" "GO:0008023" "GO:0008157" "GO:0008270"
## [21] "GO:0009653" "GO:0010074" "GO:0010332" "GO:0016458" "GO:0016607"
## [26] "GO:0031057" "GO:0031062" "GO:0031436" "GO:0032409" "GO:0032786"
## [31] "GO:0034243" "GO:0035065" "GO:0036464" "GO:0040029" "GO:0042127"
## [36] "GO:0042800" "GO:0042803" "GO:0042981" "GO:0043970" "GO:0044030"
## [41] "GO:0044093" "GO:0044648" "GO:0044666" "GO:0045322" "GO:0045944"
## [46] "GO:0048872" "GO:0051050" "GO:0051053" "GO:0051569" "GO:0065003"
## [51] "GO:0070531" "GO:0070577" "GO:0071339" "GO:0071479" "GO:0080182"
## [56] "GO:0099402" "GO:1902750" "GO:1903507" "GO:1990904" "GO:2000113"
## [61] "GO:2001025" "GO:2001038"
## 
## $Mp1g00080
## [1] "GO:0005794" "GO:0009570" "GO:0009941"
## 
## $Mp1g00090
## [1] "GO:0005773" "GO:0005829" "GO:0012505" "GO:0031410" "GO:0034272"
## [6] "GO:0098588" "GO:0098805"
## 
## $Mp1g00100
##  [1] "GO:0000137" "GO:0000138" "GO:0005634" "GO:0005768" "GO:0005797"
##  [6] "GO:0005802" "GO:0007155" "GO:0008194" "GO:0009507" "GO:0009832"
## [11] "GO:0010214" "GO:0010395" "GO:0031224" "GO:0045489" "GO:0070592"
## [16] "GO:0080157"
## 
## $Mp1g00110
##  [1] "GO:0000491" "GO:0001094" "GO:0005634" "GO:0005737" "GO:0042802"
##  [6] "GO:0051117" "GO:0070062" "GO:0070761" "GO:0097159" "GO:1901363"

To perform GO semantic clustering, the analysis is based on rrvgo5 and GOSemSim9 packages. The available species for this analysis are included in Bioconductor webpage. The only plant organism included is Arabidopsis thaliana, so we will use its code (org.At.tair.db) for this analysis. We can save this data in an object, so it will save computing time in the following analysis. There are three classes of GO terms: Biological Processes (BP), Molecular Functions (MF) and Cell Components (CC). In this example, we will only be focused on biological processes.

save <- GOSemSim::godata("org.At.tair.db", ont="BP")

Differential gene expression analysis

At this point, we already have everything we need to perform our RNA-seq analysis: a directory with our quantified samples (sampleDir), a data frame with the experiment metadata (sample_table), a data frame correlating gene names and transcript names for our organism (Marchantia7_tx2gene), an annotation list correlating transcript names and their annotated GO terms (geneID2GO) and the model organism reference data for the semantic clustering (save).

We want to obtain differential expressed genes (DEGs) in Tak-1 genotype in OPDA treatment compared to Tak-1 genotype in Mock conditions. To organize our results, we will create a folder to save all differential expression analyses and we will call it “DESeq2”, and inside we will create another folder to save specifically this comparison, and we will call it “Tak1_OPDA”.

dir.create(file.path(output_dir, "DESeq2"))
dir.create(file.path(output_dir, "DESeq2", "Tak1_OPDA"))
output_Dir_DESeq2 <- file.path(output_dir, "DESeq2", "Tak1_OPDA")

To run the analysis, we should install DESeq2 and topGO packages from Bioconductor and load them in our R session.

Now, we are ready to run our differential expression analysis. For that, we will use DESeq2_simple function. We set all required parameters that we previously defined (output_path, sampleDir, sample_table, tx2gene, geneID2GO, orgdb and semdata). But, still, there are some parameters that we have to define for this analysis:

  • Name: We have to include a string to name all files that are going to be generated. In this example, we use the same as the folder where they will be saved (“Tak1_OPDA”).
  • Variable: Vector of variables or factors that are affecting gene expression in the experiment.
  • Design: Design formula indicating the variables which will be used in the modelling, i.e., which variables we expect that are having an effect in gene expression in the analysis. In this example, we expect that “Genotype” itself (there will be genes which expression will be expected to be affected only by genotype in any condition), “Treatment” itself (there will be genes which expression will be expected to be affected only by treatment in any genotype), and the interaction “Genotype+Treatment” ((there will be genes which expression will be expected to be affected differently by treatment in a genotype compared to the other)), will be affecting gene expression. Therefore, our design formula will be “Genotype + Treatment + Genotype:Treatment”. (For more information about design, read the DESeq2 vignette).
  • Contrast: The specific contrast we want to perform to test differential expression. If we do not specify any contrast, the function will stop and there will be printed all possible contrast we can use in the experimental design we are using. There are two different ways to define the contrast we want to:
    • Condition A - Condition B.
    • list(c(“Contrast”)).

For more information about contrasts, read the DESeq2 vignette and check this tutorial.

library(DESeq2)
library(topGO)

DESeq2_simple(
    output_path = output_Dir_DESeq2,
    sampleDir = sampleDir,
    sample_table = sample_table,
    tx2gene = Marchantia7_tx2gene,
    geneID2GO = geneID2GO,
    orgdb = "org.At.tair.db",
    semdata = save,
    Name = "Tak1_OPDA",
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment"
  )
## [1] These are the DESeq2 contrasts, that you have to use like list(c("ContrastA", "ContrastB")) : 
## [1] "Intercept"                  "Genotype_gh3a_vs_Tak1"     
## [3] "Treatment_OPDA_vs_Mock"     "Genotypegh3a.TreatmentOPDA"
## [1] These are the costumized contrasts, that you have to use like ContrastA - ContrastB : 
## [1] "Tak1"      "gh3a"      "Mock"      "OPDA"      "Tak1_Mock" "Tak1_OPDA"
## [7] "gh3a_Mock" "gh3a_OPDA"
## Error in DESeq2_simple(output_path = output_Dir_DESeq2, sampleDir = sampleDir, : Choose one of the previous contrasts

After running DESeq2_simple with no Contrast parameter, there is an error telling us we have to choose one of the previous contrasts printed in the console. In this example, we want to compare Tak-1 genotype in OPDA treatment samples vs. Tak-1 genotype in Mock conditions. There are two different ways we can set this contrast:

  • Tak1_OPDA - Tak1_Mock.
  • list(c(“Treatment_OPDA_vs_Mock”)). Since we established Tak-1 as our reference genotype (first factor), the treatment OPDA vs Mock contrast will get this comparison in the reference genotype, i.e., Tak-1.
Tak1_OPDA_result <- DESeq2_simple(
    output_path = output_Dir_DESeq2,
    sampleDir = sampleDir,
    sample_table = sample_table,
    Include = NULL,
    Exclude = NULL,
    tx2gene = Marchantia7_tx2gene,
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment",
    Group = "NO",
    Name = "Tak1_OPDA",
    Contrast = Tak1_OPDA - Tak1_Mock,
    Reduced = FALSE,
    log2FCtopGO = 1,
    geneID2GO = geneID2GO,
    ontology = "BP",
    plot_similarity = TRUE,
    orgdb = "org.At.tair.db",
    semdata = save
  )

If we want to perform other possible analyses, these would be the contrast we would have to use:

  • OPDA vs Mock in gh3a genotype: gh3a_OPDA - gh3a_Mock, or list(c(“Treatment_OPDA_vs_Mock”, “Genotypegh3a.TreatmentOPDA”)).

  • gh3a vs Tak-1 in Mock conditions: gh3a_Mock - Tak1_Mock, or list(c(“Genotype_gh3a_vs_Tak1”)).

  • gh3a vs Tak-1 in OPDA conditions: gh3a_OPDA - Tak1_OPDA, or list(c(“Genotype_gh3a_vs_Tak1”, “Genotypegh3a.TreatmentOPDA”)).

  • Genes affected by the Genotype:Treatment interaction: (gh3a_OPDA - gh3a_Mock) - (Tak1_OPDA - Tak1_Mock), or list(c(“Genotypegh3a.TreatmentOPDA”)). This are genes that will be responding to OPDA treatment in the gh3a genotype in a different way that the expected from the response in Tak-1 genotype.

dir.create(file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction"))
output_Dir_interaction <- file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction")
Genotype_Treatment_result <- DESeq2_simple(
    output_path = output_Dir_interaction,
    sampleDir = sampleDir,
    sample_table = sample_table,
    Include = NULL,
    Exclude = NULL,
    tx2gene = Marchantia7_tx2gene,
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment",
    Group = "NO",
    Name = "Genotype_Treatment_interaction",
    Contrast = list(c("Genotypegh3a.TreatmentOPDA")),
    Reduced = FALSE,
    log2FCtopGO = 1,
    geneID2GO = geneID2GO,
    ontology = "BP",
    plot_similarity = TRUE,
    orgdb = "org.At.tair.db",
    semdata = save
  )

Sample selection

By default, all samples in sampleDir will be included for the differential expression analysis using DESeq2_simple function. However, it could be interesting to perform an analysis including only a subset of these samples. For example, let’s think about the scenario where we want to study DEGs by OPDA in Tak-1 genotype ignoring the existence of gh3a in the experiment. Thus, we would only want to include samples belonging to Tak-1 genotype. There are two possible ways of doing that in DESeq2_simple function: by using Include or Exclude parameters.

  • Include: we define features in a vector that all samples must present to be included in the analysis. In this example, it would be Include = c("Tak1").
  • Exclude: we define features in a vector that we want to remove from the analysis. All samples containing any of these features will be removed. In this example, it would be Exclude = c("gh3a").

In our example, both parameters would work the same way, i.e., only including Tak-1 samples in the analysis, both in OPDA and Mock treatments.

PCA

Apart from differential expression analysis, all samples included in the DESeq2_simple function will be compute for principal component analysis and we will obtain a png file of PC1 vs PC2 using 500 top genes by default. If we do not want to generate the PCA, we can set the PCA parameter to FALSE withing the DESeq2_simple function. If we want to perform a different principal component analysis (plotting other component, using a different number of top genes, changing colors, etc…) we can do it by running the run_pca_analysis function.

Output

After running DESeq2_simple function, there are some files generated with results from the differential gene expression analysis:

  • Two tab-delimited text files are generated including results from the differential expression analysis of all genes (Name.txt) and only significantly expressed genes, i.e., \(padj \le 0.05\) (Name_Sig.txt).
##            baseMean log2FoldChange     lfcSE       stat       pvalue
## Mp1g00070  147.7289    -0.87073377 0.2215426 -3.9303223 8.483207e-05
## Mp1g00080 8077.3196    -0.54069976 0.1329801 -4.0660199 4.782285e-05
## Mp1g00090  531.2465    -0.05551592 0.1356234 -0.4093389 6.822909e-01
## Mp1g00100 1127.5365     0.87159546 0.1302033  6.6941140 2.169823e-11
## Mp1g00110  419.9734    -0.10280589 0.1763512 -0.5829610 5.599196e-01
## Mp1g00120  511.5859    -0.35720864 0.1180411 -3.0261372 2.476999e-03
##                   padj
## Mp1g00070 3.932574e-04
## Mp1g00080 2.335654e-04
## Mp1g00090 7.772935e-01
## Mp1g00100 2.389950e-10
## Mp1g00110 6.791640e-01
## Mp1g00120 7.944981e-03
  • A PDF file including a heatmap of \(log2(n + 1)\) normalized values of all samples included in the analysis (Name_heatmap.pdf).
## Converting page 1 to Tak1_OPDA_heatmap_1.png... done!
Heatmap of normalized values of DEGs in the OPDA vs Mock comparison for Tak-1 genotype

Heatmap of normalized values of DEGs in the OPDA vs Mock comparison for Tak-1 genotype

  • A png file including the PCA of all samples included in the analysis, only if PCA parameter is set as the TRUE default value (PCA_Name.png).
PCA plot of PC1 vs PC2, using 500 top genes

PCA plot of PC1 vs PC2, using 500 top genes

  • A “topGO” folder, including results of GO enrichment analysis for Up, Down and Up+Down DEGs (\(padj \le 0.05\), and \(log2(FC) \ge 1\) for Up-regulated and \(log2(FC) \le -1\) for Down-regulated genes):
    • Two tab-delimited text files including results from the GO enrichment analysis. One include detailed results (topGO_Name_All/Up/Down_results.txt) and the other only enriched GO terms and their p-values (topGO_Name_All/Up/Down_pvals.txt).
##            Annotated Significant Expected
## GO:0044238      6226         509   528.04
## GO:0045893       688          59    58.35
## GO:0048574        90           9     7.63
## GO:0010154      1454         150   123.32
## GO:0044706       462          48    39.18
## GO:0044281      1878         198   159.28
##                                                          Term         pval
## GO:0044238                          primary metabolic process 0.0061337475
## GO:0045893 positive regulation of DNA-templated transcription 0.0060918982
## GO:0048574                 long-day photoperiodism, flowering 0.0049341244
## GO:0010154                                  fruit development 0.0002647838
## GO:0044706               multi-multicellular organism process 0.0321583675
## GO:0044281                   small molecule metabolic process 0.0091339926
##            Enrichment
## GO:0044238   1.006982
## GO:0045893   1.056274
## GO:0048574   1.231723
## GO:0010154   1.270691
## GO:0044706   1.279712
## GO:0044281   1.298621
  • Three PDF files for GO enrichment visualization:
    • A tree map plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Treemap.pdf).

    • A scatter plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Scatterplot.pdf).

    • A bubble plot of all enriched GO terms in descendant order according to their enrichment (topGO_Name_All/Up/Down_bubble.pdf).

Tree map of Gene Ontologies enriched in Up-regulated genes

Tree map of Gene Ontologies enriched in Up-regulated genes

Semantic cluster of Gene Ontologies enriched in Up-regulated genes

Semantic cluster of Gene Ontologies enriched in Up-regulated genes

Bubble plot of Gene Ontologies enriched in Up-regulated genes

Bubble plot of Gene Ontologies enriched in Up-regulated genes

Clustering of co-expressed genes

WGCNA_Module() function general scheme

Another application of the package is the generation of co-expressed gene modules by the implementation of components from the WGCNA3 package. This functionality is included in the WGCNA_Modules function. As DESeq2_simple function, it requires to specify the output_path, sampleDir, sample_table, Variables, tx2gene, Name, geneID2GO and Annotation parameters. The specific parameters that have also to be included in this function are:

  • DEGs: A vector of Gene IDs. This will be the input set of genes that we want to cluster. The Gene IDs must correspond to the transcriptome that we are using in the analysis.
  • Power: Soft thresholding power used in the network construction (check the WGNA manual or tutorial for more info. If we run the WGCNA_Modules function without the Power parameter specified, it will stops and generate a “softthresholdingpower.pdf” file, that can help us to select the power. As an alternative, this table suggest the power to select depending on the number of samples and type of network: Suggested Soft thresholding powers

Most of the advanced parameters in for the blockwiseModules function of the WGCNA package for the module generation are set to default.

As an example, we will use the output DEGs from the Genotype:Treatment interaction in DESeq2_simple, i.e., genes that significantly respond to the OPDA treatment in the gh3a genotype in a different way than the Tak-1 genotype. This different response can be in different ways, so we will use the gene co-expression modules to explore this responses.

DESeq2_simple_output <- sigtable <- read.table(file.path(output_Dir_interaction, "Genotype_Treatment_interaction.txt"), header = TRUE)

## We get rownames (GeneIDs) of the output data frame from DESeq2_simple(), applying a filter of log2FC > 1 for Up-regulated DEGs and log2FC < -1 for Down-regulation.

DEGs <- rownames(DESeq2_simple_output %>% filter(log2FoldChange > 1 | log2FoldChange < -1))

Now that we already have selected our input gene set, we can run the WGCNA_Modules function:

library(WGCNA)
dir.create(file.path(output_dir, "WGCNA"))
output_WGCNA <- file.path(output_dir, "WGCNA")
COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
                                   sampleDir = sampleDir,
                                   sample_table = sample_table,
                                   DEGs = DEGs,
                                   Variables = c("Genotype", "Treatment"),
                                   tx2gene = Marchantia7_tx2gene,
                                   Name = "Genotype_Treatment_Interaction",
                                   NumberCol = 1,
                                   geneID2GO = geneID2GO,
                                   semdata = save)
##  Flagging genes and samples with too many missing values...
##   ..step 1
## pickSoftThreshold: will use block size 886.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 886 of 886
## Warning: executing %dopar% sequentially: no parallel backend registered
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1    0.263  0.783          0.143  331.00   327.000  483.0
## 2      2    0.403 -0.390          0.352  179.00   158.000  345.0
## 3      3    0.762 -0.636          0.874  115.00    86.100  274.0
## 4      4    0.825 -0.726          0.933   81.60    50.200  230.0
## 5      5    0.857 -0.769          0.957   61.60    31.700  199.0
## 6      6    0.886 -0.799          0.980   48.60    21.000  175.0
## 7      7    0.883 -0.823          0.945   39.60    15.400  158.0
## 8      8    0.908 -0.837          0.958   33.10    11.700  143.0
## 9      9    0.919 -0.840          0.963   28.20     8.820  131.0
## 10    10    0.930 -0.856          0.966   24.40     6.890  121.0
## 11    12    0.935 -0.865          0.970   18.90     4.310  105.0
## 12    14    0.944 -0.874          0.969   15.10     2.870   92.3
## 13    16    0.956 -0.880          0.979   12.50     1.980   82.4
## 14    18    0.955 -0.878          0.973   10.50     1.430   74.4
## 15    20    0.952 -0.898          0.969    8.96     1.090   67.6
## 16    22    0.962 -0.905          0.970    7.76     0.806   62.0
## 17    24    0.968 -0.905          0.970    6.79     0.597   57.1
## 18    26    0.975 -0.900          0.979    6.00     0.444   53.1
## 19    28    0.973 -0.905          0.970    5.35     0.360   49.5
## 20    30    0.976 -0.919          0.973    4.80     0.278   46.3
## 21    32    0.977 -0.930          0.972    4.33     0.218   43.5
## 22    34    0.982 -0.939          0.979    3.93     0.174   41.0
## 23    36    0.977 -0.931          0.973    3.58     0.137   38.7
## 24    38    0.985 -0.934          0.982    3.28     0.109   36.6
## 25    40    0.974 -0.936          0.968    3.02     0.086   34.7
## Error in WGCNA_Modules(output_path = output_WGCNA, sampleDir = sampleDir, : Choose a power based on softthresholdingpower.pdf

We did not include the Power parameter, so the function stopped at warned us to select it based on the “softthresholdingpower.pdf” file.

R^2 values as a function of the soft thresholds

R^2 values as a function of the soft thresholds

Based on this plot, we will select 7 as the soft thresholding power since it is the first value that rises the threshold.

COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
                                   sampleDir = sampleDir,
                                   sample_table = sample_table,
                                   DEGs = DEGs,
                                   Variables = c("Genotype", "Treatment"),
                                   tx2gene = Marchantia7_tx2gene,
                                   Name = "Genotype_Treatment_Interaction",
                                   Power = 7,
                                   NumberCol = 1,
                                   geneID2GO = geneID2GO,
                                   semdata = save)
##  Flagging genes and samples with too many missing values...
##   ..step 1
## pickSoftThreshold: will use block size 886.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 886 of 886
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1    0.263  0.783          0.143  331.00   327.000  483.0
## 2      2    0.403 -0.390          0.352  179.00   158.000  345.0
## 3      3    0.762 -0.636          0.874  115.00    86.100  274.0
## 4      4    0.825 -0.726          0.933   81.60    50.200  230.0
## 5      5    0.857 -0.769          0.957   61.60    31.700  199.0
## 6      6    0.886 -0.799          0.980   48.60    21.000  175.0
## 7      7    0.883 -0.823          0.945   39.60    15.400  158.0
## 8      8    0.908 -0.837          0.958   33.10    11.700  143.0
## 9      9    0.919 -0.840          0.963   28.20     8.820  131.0
## 10    10    0.930 -0.856          0.966   24.40     6.890  121.0
## 11    12    0.935 -0.865          0.970   18.90     4.310  105.0
## 12    14    0.944 -0.874          0.969   15.10     2.870   92.3
## 13    16    0.956 -0.880          0.979   12.50     1.980   82.4
## 14    18    0.955 -0.878          0.973   10.50     1.430   74.4
## 15    20    0.952 -0.898          0.969    8.96     1.090   67.6
## 16    22    0.962 -0.905          0.970    7.76     0.806   62.0
## 17    24    0.968 -0.905          0.970    6.79     0.597   57.1
## 18    26    0.975 -0.900          0.979    6.00     0.444   53.1
## 19    28    0.973 -0.905          0.970    5.35     0.360   49.5
## 20    30    0.976 -0.919          0.973    4.80     0.278   46.3
## 21    32    0.977 -0.930          0.972    4.33     0.218   43.5
## 22    34    0.982 -0.939          0.979    3.93     0.174   41.0
## 23    36    0.977 -0.931          0.973    3.58     0.137   38.7
## 24    38    0.985 -0.934          0.982    3.28     0.109   36.6
## 25    40    0.974 -0.936          0.968    3.02     0.086   34.7
##  Calculating module eigengenes block-wise from all genes
##    Flagging genes and samples with too many missing values...
##     ..step 1
##  ..Working on block 1 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 1 into file RNAseqEasy-block.1.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 8 genes from module 1 because their KME is too low.
##      ..removing 2 genes from module 2 because their KME is too low.
##      ..removing 5 genes from module 3 because their KME is too low.
##      ..removing 1 genes from module 4 because their KME is too low.
##  ..merging modules that are too close..
##      mergeCloseModules: Merging modules whose distance is less than 0.25
##        Calculating new MEs...
## Warning in RColorBrewer::brewer.pal(length(unique_groups), "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in plot_topGO_results(go_results, title = name, output_path =
## paste0(output_prefix, : No GO results to plot.
## Warning in rrvgo::calculateSimMatrix(terms, orgdb = orgdb, ont = ontology, : No terms were found in orgdb for BP
## Check that the organism and ontology match the ones provided by orgdb
## Warning in analyze_GO_similarity(go_results, orgdb = orgdb, ontology =
## ontology, : Not enough valid GO terms or similarity matrix could not be
## computed. Skipping visualization.

There are 10 different clusters that have been generated after the analysis. The average behavior of genes from each category are summarized in the “Summary_Modules_Name.pdf” file:

Summary of gene expression behavior in each co-expression module of genes responding to the Genotype:Treatment interaction

Summary of gene expression behavior in each co-expression module of genes responding to the Genotype:Treatment interaction

topGO_All() function general scheme

1.
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. 15, 550 (2014).
2.
Alexa, A. & Rahnenfuhrer, J. topGO: Enrichment analysis for gene ontology. (2021).
3.
Langfelder, P. & Horvath, S. WGCNA: An r package for weighted correlation network analysis. 559 (2008).
4.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods 14, 417–419 (2017).
5.
Sayols, S. Rrvgo: A bioconductor package for interpreting lists of gene ontology terms. (2023) doi:10.17912/MICROPUB.BIOLOGY.000811.
6.
Wickham, H. et al. Welcome to the tidyverse. Journal of Open Source Software 4, 1686 (2019).
7.
8.
9.
Yu, G. Gene Ontology Semantic Similarity Analysis Using GOSemSim. in 207–215 (Springer US, 2020). doi:10.1007/978-1-0716-0301-7_11.